Multipath Translation Lexicon Induction via Bridge Languages

نویسندگان

  • Gideon S. Mann
  • David Yarowsky
چکیده

Gideon S. Mann and David Yarowsky Department of Computer S ien e Johns Hopkins University Baltimore, MD 21218 USA fgsm,yarowskyg s.jhu.edu Abstra t This paper presents a method for indu ing translation lexi ons based on transdu tion models of ognate pairs via bridge languages. Bilingual lexi ons within languages families are indu ed using probabilisti string edit distan e models. Translation lexi ons for arbitrary distant language pairs are then generated by a ombination of these intra-family translation models and one or more ross-family online di tionaries. Up to 95% exa t mat h a ura y is a hieved on the target vo abulary (30-68% of inter-family test pairs). Thus substantial portions of translation lexi ons an be generated a urately for languages where no bilingual di tionary or parallel orpora may exist. 1 Translation Lexi ons, Cognates, and Bridge Languages A translation lexi on is a mapping from words in one language (the sour e) to words in another language (the target). For ea h word in the sour e , this di tionary provides one or more words in the target whi h might be appropriate translations in some ontext. Su h a lexi on is the foundation of any ma hine translation system. Translation lexi ons are available on-line for many of the world's major langauges, but they are often quite limited and may have intelle tual property onstraints. For lower-density languages, translation lexi ons typi ally exist only as a hardopy di tionary (if at all). Creating a translation lexi on from s rat h requires timeonsuming work by experts trained in both languages. Automati methods to generate even partial di tionaries would signi antly de rease the human e ort needed to build mahine translation systems for less heavily supported languages. In this paper, we explore algorithms for building lexi ons between arbitrary languages using models of ognate pairs and ognate distan e. We de ne a ognate pair as a translation pair where words from two languages share both meaning and a similar surfa e form. Cognate pairs usually arise when both words are derived from an an estral root form (e.g. \neveu" [Fr.℄, \nephew" [Eng.℄) (Bu k, Portuguese Italian French Romanian via dictionary

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Annotating Cognates and Etymological Origin in Turkic Languages

Turkic languages exhibit extensive and diverse etymological relationships among lexical items. These relationships make the Turkic languages promising for exploring automated translation lexicon induction by leveraging cognate and other etymological information. However, due to the extent and diversity of the types of relationships between words, it is not clear how to annotate such information...

متن کامل

Bilingual Lexicon Induction for Low-resource Languages

Statistical machine translation relies on the availability of substantial amounts of human translated texts. Such bilingual resources are available for relatively few language pairs, which presents obstacles to applying current statistical translation models to low-resource languages. In this work, we induce bilingual dictionaries from more plentiful monolingual corpora using a diverse set of c...

متن کامل

A language-independent and fully unsupervised approach to lexicon induction and part-of-speech tagging for closely related languages

In this paper, we describe our generic approach for transferring part-of-speech annotations from a resourced language towards an etymologically closely related non-resourced language, without using any bilingual (i.e., parallel) data. We first induce a translation lexicon from monolingual corpora, based on cognate detection followed by cross-lingual contextual similarity. Second, POS informatio...

متن کامل

A Comprehensive Analysis of Bilingual Lexicon Induction

Bilingual lexicon induction is the task of inducing word translations from monolingual corpora in two languages. In this paper we present the most comprehensive analysis of bilingual lexicon induction to date. We present experiments on a wide range of languages and data sizes. We examine translation into English from 25 foreign languages: Albanian, Azeri, Bengali, Bosnian, Bulgarian, Cebuano, G...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001